Sketching methods with small window guarantee using minimum decycling sets

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a Decycling Set, an unavoidable sets of k-mers. Any long enough sequence, by definition, must contain a k-mer from any decycling set (hence, it is unavoidable). Conversely, a decycling set also defines a sketching method by choosing the k-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger, and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.


Introduction
Sketching methods, such as minimizers [21] or open-syncmers [4], distill a long sequence into a smaller "sketch," a set of k-mers and their positions in the sequence.By comparing these sketches, it is possible to quickly estimate whether two sequences are similar and may have a good quality alignment between them, or not.Because sketching methods greatly reduce the computational needs in many genomics algorithms with usually little impact on the quality of the result, they are used in many computational biology software packages (see [30] for a review).
For our purposes, a k-mer sketching method is modeled by a function φ that takes a context as an input (a substring of the input sequence of fixed length c) and outputs a set of positions within the context of the selected k-mers.The output of φ can be the empty set, meaning that nothing is selected in this context.The sketch M φ (S) for a sequence S is the union of all selected positions over all the contexts of S (see Section 2).This sketch contains a subset of all the k-mers in S as the function φ might not pick any k-mer in a context or adjacent contexts may pick the same locations.
The two properties of sketching methods that downstream applications rely on to prove correctness are: Locality The property that similar sequences (i.e., that have reasonably long identical subsequences) will have common elements in their sketches, and hence long enough matches will be detected using the sketches.This is naturally satisfied because the selection is done using a function (φ), therefore two sequences that share an exact substring of length at least c will select the same k-mers in that context.Window guarantee The maximum distance w between two selected k-mers is the window size or guarantee.A small window size guarantees that no large part of a sequence is ignored.Equivalently, the window property means k-mers are selected at approximately regular intervals.Sketching methods are usually optimized for two metrics, density [24] and conservation [4].The density is the relative size of the sketch, formally defined as | M φ (M )|/|S|.A lower density is desirable as a smaller sketch usually implies less computation and lower memory requirements.The conservation is the proportion of elements that are common between a sketch of S and a sketch of a slightly mutated sequence S ′ , where the common elements are either k-mers or subsequences covered by these k-mers.Higher conservation is desirable because it usually correlates to higher sensitivity to detect sequence similarities in the face of mutations and errors.For a fixed k, a smaller context size leads to higher conservation, as the presence of a k-mer in the sketch of the mutated S ′ may be affected by mutations in the entire context [25].
Not all sketching methods satisfy the window guarantee property (i.e., for some sketching methods, there are infinitely long sequences S with an empty sketch; see Section 3).However, sketching methods that do not satisfy the window property are problematic in two ways.First, most algorithms using a sketching method do not have a proof of correctness in cases without the window property (e.g., an aligner may miss arbitrarily long, good quality alignments, preventing claims of sensitivity).
Second, the sketch optimization problem is ill-formed without the window property.The empty selection function that returns the empty set for any input sequence satisfies vacuously the locality property, it has perfect conservation, and it has the lowest possible density.But of course, no information is preserved in an empty sketch and this trivial solution is not useful.The existence of trivial solutions is not a purely theoretical concern.When optimizing sketching methods using machine learning, almost empty (and not practically useful) solutions are found if no window constraint is used in the loss function [10].
A set of k-mers M is unavoidable if any infinitely long sequence must have k-mers from M .
Because any sequence uniquely corresponds to a path in the de Bruijn graph D k of order k, an equivalent point of view is the decycling sets (DS): M is an unavoidable set of k-mers (and a decycling set) if and only if D k \ M , the de Bruijn graph D k with the k-mers from M removed, is a directed acyclic graph (DAG).
There is a strong two way connection between such decycling sets and sketching methods with a window guarantee.Consider the set M φ of possibly selected k-mers (the union of all k-mers selected over every possible context) for sketching method φ.If the sketching method has a window guarantee, then M φ is a decycling set.Moreover, the window size of φ is equal to the remaining path length of M φ , i.e., the length of the longest path in the DAG D k \ M φ .
The function φ of a sketching method with the smallest possible context (c = k, aka contextfree methods, such as syncmers) is equivalent to the indicator function of its set M φ : as the input context contains only one k-mer, the output of φ is not empty exactly when the input k-mer is in M φ .A sketching method with a larger context may not select every occurrence of k-mers in M φ from S. For example, a context may contain multiple k-mers from M φ but the function φ only selects one of them [2].In other words, given two sketching methods, one context-free and one with a context, having the same set of possibly selected k-mers, the method with a context can lower its density at the expanse of having a lower conservation.Conversely, given a decycling set M , the indicator function of M defines a context-free sketching method with a window guarantee.
This connection between decycling sets and sketching methods suggests, first, that the properties of the decycling sets ultimately define the properties of the associated sketching method.In other words, by studying the space of decycling sets we gain insights into the design space of sketching methods.Second, the space of decycling sets is much larger than the decycling sets generated by the few families of sketching methods currently used.Rather than creating ad hoc sketching methods, a promising strategy is to find a decycling set with desirable properties and use the sketching method associated with this set.
In this study we focus on minimum-size decycling sets (MDS).MDSs provide a logical starting point for the study of decycling sets.First, the MDSs are by definition as small as possible, therefore reducing as much as possible the cost of storing and querying such a set.Second, these sets are likely to have short remaining path lengths, corresponding to sketching methods with small window guarantee.
After describing the window guarantee of common sketching methods, we describe the structure of the de Bruijn graph and of its cycles.We then give two simple graph operations that can be used to enumerate MDSs.Provided Conjecture 1 is true (for which we provide ample theoretical and experimental evidence), all MDSs can be reached with these operations.Using these operations we design an optimization procedure to find MDSs with short remaining path lengths.This optimization procedure gives further insight on the range of possible window guarantee for sketching methods and on the of the well-known Mykkeltveit set.
The conjectures and optimization methods proposed here are the basis to further the understanding of MDSs and the design space of the sketching methods that are central to computational biology algorithms, in particular sketching methods with a small context and a strong window guarantee.

Preliminaries and notations
An alphabet is a small set Σ of size σ = |Σ|.Although the results generalize to any alphabet size, we consider the binary alphabet Σ = {0, 1} and the DNA alphabet {A, C, G, T } of size 4.A sequence S is an element of Σ * , and sequences are indexed starting at 1. S[a : k] represent the subsequence starting at position a of length k, i.e., the ath k-mer of S.
A sketching scheme is defined by its selection function φ : Σ c → P([c − k + 1]), where P denotes the power set.The contexts of S are all the subsequences of length c: S The sketch of S is the set of the positions of the selected k-mers in S: The set of all possibly selected k-mers for the sketching method φ is The de Bruijn graph of order k is the directed graph D k = (Σ k , E k ), where each k-mer is a node and the edges u → v represent the suffix-prefix relationship u[2 . The de Bruijn graph is σ-regular, Eulerian and Hamiltonian.For convenience, short strings, such as k-mers, are commonly represented as based-σ numbers.

Window guarantee of existing sketching schemes
We review sketching methods commonly used in computational biology and evaluate their window guarantee.
Hash-based methods.Hash methods use a hash function h and select the k-mers m that satisfy, for example, h(m) = 0 mod p or h(m) < t for some predefined constants p, t [13,5].Effectively the hash function randomizes the k-mers and the criteria selects a subset of the k-mers.Other methods apply a sketching method like minimizers or syncmers and further down-sample the sketch using a hash function [23,4].
In general these methods do not have a window guarantee and, historically, this was one of the motivations for Schleimer [24] to introduce the winnowing scheme (which is equivalent to minimizers).Although these schemes can have low density and have a short context (c = k), it is achieved at the cost of having no window guarantee.For example, by choosing low values of the threshold t, the density can be made arbitrarily low, but the number of distinct cyclic sequences not covered by the scheme increases dramatically.
Window-based methods.These methods always pick at least one k-mer in each context, therefore the context and the window guarantee are closely linked.
The minimizer scheme has three parameters (k, w, O) and in each window of w consecutive k-mers (i.e., the context is a substring of length w + k − 1), the selection function returns the position of the smallest k-mer according to the order O [21,22].There are many ways to select the order O [29,27,11,12], for example to improve the density, but because the selection function never returns the empty set, all these methods have a window guarantee of w, independent of the choice of O.
The density of minimizers schemes is usually between 1.5/(w + 1) and 2/(w + 1) [16,15], and the context length is c = w + k − 1. Density can be lowered by increasing w, although this increases the context length (hence weakens the locality and lowers the conservation).Having a coupling between the window guarantee and the context length constrains the parameter choices for minimizer schemes.
Compared to minimizers, the minmers scheme [14] adds a fourth parameter d: in each window of dw consecutive k-mers the selection function returns the position of the d smallest k-mers according to O. Minmers achieve a density closer to 1/w while having a significantly longer context of dw + k − 1.
Positional minimums.Under this generic name are methods such as open-syncmers [4], masked minimizers [10] and parameterized syncmers [3].These schemes have four parameters (k, s, O, m) where s ≤ k and m is a non-empty bit-mask of length k.A context of length c = k is selected if the smallest s-mer in the context (choose left-most to break ties) is at position i and bit i is set in the mask m.
Whether these schemes have a window guarantee depends on whether the first bit of m is set.If the first bit is set and a k-mer is selected, then this implies that an s-mer at position i > 1 is strictly smaller than the s-mer at position 1, forming a decreasing list of s-mers.As the k-mers are shifted along the sequence, this decreasing list of s-mers must eventually come to an end, hence there is a window guarantee.This window guarantee is weak as the window can be as long as σ k−1 (see Supplementary Material 1).
If the first bit is not set, because of the left-most tie breaking rule, there is no window guarantee.Hence, these methods have a short context and a weak or missing window guarantee.

Cycle structure of the de Bruijn graph
There exists two methods to generate decycling sets of minimum size by Mykkeltveit [17] and Champarnaud [1].These algorithms are of great theoretical importance as they settled a conjecture of Golomb [9] on the size of an MDS.They are also practical algorithms as membership in these MDSes is testable in time and memory polynomial in k (i.e., the entire set does not need to be precomputed and stored).But, as we shall see, the space of all MDSs is much larger than these two MDSs.
We provide a method that uses only two simple graph operations-called F-move and I-movethat transform an MDS into another MDS.Furthermore, we conjecture that these two operations are sufficient to enumerate all MDSs.In other words, given a graph where the nodes are all the MDSs and the edges represent these operations, Conjecture 1 states that this graph is strongly connected.We give theoretical and computational evidence to support this conjecture.
This section describes the structure of the cycles in the de Bruijn and how through these two operation MDSs interact with the cycles.Although these two operations are similar in nature and together they might enumerate all MDSs, we describe them separately as they have qualitatively distinct effects on the MDSs (see Proposition 2 and Conjecture 2).
A pure cycling register (PCR), aka a conjugacy class, is a cycle in the de Bruijn graph made of the circular permutation of a k-mer.For example, the PCR of the 4-mer 1011 over the binary alphabet is 1011 → 0111 → 1110 → 1101 → 1011.The PCRs form a partition of the k-mers and therefore any MDS must contain at least one k-mer from each PCR.We call a k-mer set with exactly one k-mer in each PCR a PCR set.The theorems of Mykkeltveit [17] and Champarnaud [1] show that every MDS is a PCR set.On the other hand, not every PCR set is an MDS.

F-moves
The left-companions (resp.right-companions) is the set of k-mers that have the same suffix (resp.prefix).Given f ∈ Σ k−1 , then lc(f ) ≜ {af | a ∈ Σ} are the left companions sharing the suffix f , and rc(f ) ≜ {f a | a ∈ Σ} are the right companions.See Figure 1 for examples.If f = a k−1 , then the k-mers af and f a are equal (homopolymer a k ), and this k-mer is both in the left-and right-companion sets for f .The homopolymers are the only such k-mers.Every other k-mer is a left-companion for exactly one suffix and a right-companion for a different prefix. .For any f ∈ Σ k−1 there are 1 F-move, 1 RF-move and 2 σ − 2 I-moves possible, unless f is a homopolymer.

Proposition 1 (Existence of F-moves)
In any MDS M , there exists f, f ′ ∈ Σ k−1 such that M contains the left companions of f and the right companions of f ′ .
Proof.By contradiction, assume there is no such f ′ .Color all the nodes of the graph blue and do a random walk in the graph, starting from any node not in M , avoiding the nodes in M .Color in red the nodes traversed.Any k-mer m is the left-companion of a suffix, say f m , and every outgoing edge from m is an incoming edge to a right-companion of f m (see Figure 1).Because no right-companion sets are in M , it is always possible to continue the walk avoiding M from any m.Given that the graph is finite, the red nodes will eventually create a cycle, contradicting M being a decycling set.The same reasoning applies for the existence of f traversing edges in the reverse direction.

■
An F-move (named after Fredricksen [7]) in M for f ∈ Σ k−1 is the operation of changing the set of left-companions of f for the set of right-companions, as shown in Figure 1.We use the functional notation f M to designate the set obtained by the valid F-move This is a valid operation only when M contains lc(f ).As a consequence of Proposition 1 there always exists a valid F-move in an MDS.The RF-move (reverse F-move) is the inverse operation, valid when Proposition 2 (F-moves preserve decycling sets) Let M be an MDS such that lc(f ) ⊂ M , then f M is also an MDS.
Proof.If there is a cycle that avoids f M , then it must use one of the nodes in lc(f ), otherwise it was already a cycle avoiding M .Any cycle using a node in lc(f ) then must use a node in rc(f ) ⊂ f M .

■
An analogous statement holds for RF-moves.F-moves give a procedure to enumerate some MDSs, starting for example from either the Mykkeltveit or Champarnaud set and repeatedly applying a (guaranteed-to-exist by Prop. 1) F-move.Unfortunately, not all MDSs are reachable using only F-moves.The MDS graph G MDS (σ, k) has all the MDSs as nodes and edges that represent F-moves operations between MDSs.G MDS is not connected, as seen in Figure 2, but its components have a well characterized structure (proof in Supplementary Material 2).
Proposition 3 (G MDS component structure) For any σ and k, the components of G MDS (σ, k) satisfy: 1. every component is strongly connected 2. every cycle is of length

I-moves
An I-move, as in an "incomplete F-move", is valid when M contains a mixture of left-and rightcompanions: for some f ∈ Σ k−1 and ∀a ∈ Σ, either af or f a is in M .See Figure 1 for an example.For a given f ∈ Σ k−1 , there are 2 σ − 2 distinct I-moves: one for each possible choice of left-companions nodes in M , excluding the F-move (all of lc(f )) and the RF-move (none of lc(f )).
There is one exception: when f = a k−1 is a homopolymer, af = f a is both in lc(f ) and rc(f ) and the number of possible I-moves for is interpreted as a bit-mask giving the nodes from lc(f ) (i.e., the ath bit An identical argument as for Proposition 2 shows that applying a valid I-move to an MDS also gives an MDS Although F-moves and I-moves seem like similar operations and both preserve MDSs, they have distinct effect on MDSs.First, empirically we observe that I-moves, unlike F-moves, are not always possible.MDSs always have a valid F-move (Proposition 1), while an MDS may not have any valid I-move.All of the σ k−1 F-moves are an edge in every component of the MDS graph, while not all of the σ k−1 • (2 σ − 2) I-moves are valid in at least one MDS of the entire MDS graph.In particular, no MDS for σ = 2 and k = 5 have any valid I-move.
Second, F-moves not only preserve the decycling property of MDSs, but they also preserve the "coverage" of every cycle by an MDS.To make this notion precise, define the hitting number of a cycle C of D k by the MDS M as the size of their intersection: PCRs for example have a hitting number of 1 while any Hamiltonian cycle has a hitting number equal to |M |.
Furthermore, the cycle signature of MDS M is the vector of all hitting numbers for all possible cycles: . Per the following proposition, F-moves preserve hitting numbers and signatures, while I-moves do not.

Proposition 4
1. Let M be an MDS and f a valid F-move in M , then for any cycle C, Proof.Let f be a valid F-move in MDS M , and C be a cycle of D k .Because every outgoing edge of a node in lc(f ) is an incoming edge to a node in rc(f ), C must contain as many nodes from lc(f ) as from rc(f ) (which can be 0).Before the F-move, all the nodes from lc(f ) and none from rc(f ) are in M , while the opposite is true for f M .Hence the hitting number is unaffected by the F-move, proving 1.
Let f | m be a valid I-move in M such that af ∈ lc(f ) and [26], there exists a path P from f b to af that avoids cf, c ∈ Σ \ {a}.Path P followed by edge af . By the same construction, there exists a "complementary" cycle C ′ using bf and f a such that As a component of G MDS is strongly connected by F-moves, statement 3 is a direct consequence of 1.A proof for 4 is given in Supplementary Material 3.

■
As a consequence of this proposition, the hitting number and signature are constant over a component of the MDS graph, and the hitting number H χ (C) and the signature S(χ) are well defined for a component χ.Because an I-move changes the signature, every I-move links MDSs from different components.Consider now the component graph G comp (σ, k) with one node for each component of G MDS and a directed edge from component χ 1 → χ 2 if there is an I-move from an MDS M 1 ∈ χ 1 to M 2 ∈ χ 2 .In fact, as stated in the following Proposition, G comp is an undirected graph (proof in Supplementary Material 4).

Enumerating all MDSs
We make the following two conjectures regarding the use of I-moves to enumerate all MDSs.

Conjecture 1 (Connectivity by I-moves)
The G comp graph is connected.Equivalently, every MDSs is reachable from the Mykkeltveit MDS using a sequence of F-moves and I-moves.
This conjecture is supported by the previous theoretical results, in particular that all the components have a different signature and that the I-move always change the signatures.For reasonable values of k (σ = 2, k ≤ 7), it is computationally feasible to enumerate all PCR sets and check which of them are also decycling sets.Using this brute force method we can confirm that The following conjecture is also verified up to k = 7 and exposes another fundamental difference between F-moves and I-moves.Every F-move is always valid in every component, while the valid I-moves identify a component (similarly to the cycle signature).For a component χ, let the list of I-moves be Conjecture 2 (I-move signature) Every component in G MDS has a distinct list of valid I-moves.
The validity of this second conjecture is likely related to the previous one.To prove Conjecture 1, one needs to show that for any two components χ 1 , χ 2 there is a path of I-moves to go from χ 1 to χ 2 .Conjecture 2 can be used as a guide to find that path: because I(χ 1 ) ̸ = I(χ 2 ), then there exists a valid I-move in either I(χ 1 ) \ I(χ 2 ) or I(χ 2 ) \ I(χ 1 ).(Note that it is possible to have, for example, I(χ 1 ) ⊂ I(χ 2 ).) Do that I-move and repeat with the new components.Although in our testing Conjecture 2 is useful to find a path from χ 1 to χ 2 , it is not sufficient as it does not guarantee that the size of the difference between the I-move lists is decreasing.
To create Table 1 we use both conjectures: one to traverse the graph and the other to avoid enumerating a component more than once.

Non-decycling PCR sets.
Non-decycling PCR sets may also have valid F-moves and I-moves, but there are significant differences with MDSs.Unlike MDSs (see Proposition 1), a non-decycling set it is not guaranteed to contain sets of left-and right-companions.Even more, the analog graph to G MDS with nondecycling PCR sets as nodes and F-moves for edges is a non-connected graph where each component is a DAG (see Figure 2 and Supplemental Material 5).There cannot be any F-moves between an MDS and a non-decycling set.On the other hand, there can be an I-move from a non-decycling set to an MDS (but not the other way around).

Remaining path length and window guarantee
By traversing the component graphs and the MDS graph, one can search for MDSs with desirable properties.Unfortunately, as seen in Table 1, every aspect of these graphs (i.e., number of MDS, number of components, layer size, etc.) seem to have super-exponential growth.Enumerating all MDSs for k ≥ 9 with the binary alphabet is likely not reasonable, and for the DNA alphabet it is even more difficult.In this section, we provide some methods to explore the space of MDSs more efficiently and study the window guarantee of MDSs.

Efficiently traversing the component graph.
As is seen in Table 1, the number of MDSs and components is increasing quickly with k, although an actual estimate of the growth as a function of k is not known.The memory used to traverse a component can be reduced by noticing that each component is partitioned into σ k−1 layers with edges only from one layer to the next (see Figure 2).Therefore, it is only necessary to keep in memory the MDSs of the current and next layer to exhaustively enumerate every MDSs in the component.
As each component contains at least one cycle of length σ k−1 , the number of MDSs grows by at least a factor of σ k−1 faster than that of components.In fact, it grows much faster as each of the σ k−1 layers has a size that grows fast with k as well (see Table 1).While the number of MDSs and the size of the layers varies significantly between components, in general it is not efficient to traverse an entire component to find all the valid I-moves.Using the following proposition, it is possible to find all the valid I-moves in a component by considering only one MDS.
Given an MDS M , any cycle C satisfies H M (C) ≥ 1.The cycles with a hitting number of exactly 1, called constrained cycles, play an important role in the existence or not of a valid I-move: an I-move is only valid if there is no constrained cycle using edges of the I-move.This proposition, proved in Supplementary Material 6, shows that to find the list of valid Imoves in the entire component it is sufficient to find the edges not covered by a constrained cycle in just one of the MDS of the component.This holds, as by Proposition 1, the list of constrained cycles is constant across the MDSs of a component.Moreover, tagging the edges covered by constrained cycle can be done with one depth-first search for each k-mer in the MDS.The main advantage of this method is its run time is independent of the number of MDSs in the component.

Remaining path length
The remaining path length of an MDS M is the length of the longest path in the DAG obtained by removing the k-mers of M from D k .Given a selection scheme that selects in a sequence the k-mers from M , the remaining path length is precisely the window guarantee of the scheme.The following proposition gives bounds on the effect of an F-move or I-move on the remaining path length (see Figure 3).
Proposition 7 An F-move or RF-move can increase or decrease the remaining path length by at most 1.An I-move can increase the remaining path length by at most 1 or decrease it by half.
Proof.First, notice that the longest path in D k \ M must start at a valid F-move and end at a valid RF-move.Let P = (m 1 , . . ., m n ) be a longest path.The k-mer m 1 is the right-companion of some suffix f .Suppose there exists a ∈ Σ such that af / ∈ M , then the path P ′ = (af, m 1 , . . ., m n ) avoids M and is longer than P , contradicting its maximality.Therefore lc(f ) ⊂ M and f is a valid F-move in M .The proof is symmetrical for m n as the left-companion of some prefix f ′ with rc(f ′ ) ⊂ M .
Because m 1 ∈ f M , the path P is shortened by 1 by the F-move f , which may shorten the longest path if there was no other paths of that length.Also, rc(f ) ⊂ f M (i.e., f is a valid RF-move in f M but it was not in M ), hence there might be maximal path P ′ ending at a left-companion of f with |P ′ | > n.Because the F-move only moved nodes forward by one edge, |P ′ | ≤ n + 1 and the longest path may have increased by 1.The same argument applies to an RF-move.
For a valid I-move f ′′ | m in M , the same reasoning applies for increasing by 1.On the other hand, a longest path may have used an edge af ′′ → f ′′ b where M and the path is now broken in up to two parts: (m 1 , . . ., m i ) and (m i+2 , . . ., m n ).Therefore the remaining path length could be halved if i = n/2.

■
Based on this, we implemented a simulated annealing algorithm to find the smallest and largest remaining path lengths among MDSs.The longest path for the MDS M is computed using a modified topological sort of the DAG D k \ M .Supposed we are computing the smallest remaining path length.Starting from a component of the MDS graph, the program performs a fixed number of random F-moves (2k by default) and computes the remaining path length for each MDS and keeps the minimum.Then, it finds all the valid I-moves in the current component as explained in Section 5.1, and it picks one at random.
After performing the I-move, in the new component, the remaining path length is computed for 2k MDSs reachable by F-moves and a new minimum is computed.If this new minimum is lower than the previous minimum, then the new component becomes the current component.Otherwise, it becomes the current component only with some small probability.Then the process is repeated from the current component for a fixed number of iterations.As is traditional with simulated annealing, the probability to jump to "worse" components decreases over time.
Table 2 shows the remaining path length for the two previously known algorithms to generate MDSs and the range of remaining path length.These ranges are either exact when an exhaustive list of MDSs is computable, and approximated using simulated annealing otherwise.Based on the pattern that the Mykkeltveit set is always at or close to the minimum remaining path length, we conjecture that it holds for all parameter k and σ.

Per-component remaining path length
Proposition 7 gives a bound to the change in the remaining path length as the MDS graph is traversed using F-moves and I-moves.Within one component, given that every MDS is in a cycle of length σ k , the remaining path length along this cycle could change by up to σ k /2.In other words, this proposition only gives an exponential bound on the range of remaining path length within a component.The graph in Figure 3 has a point for each component at the coordinate (m P (χ), M P (χ)) where m P (χ) is the minimum of the remaining path length over all the MDSs of the component χ, and M P (χ) is the maximum.The vertical distance from the diagonal y = x represents the range of remaining path lengths within a component.We observe for k ≤ 8 on the binary alphabet that the range is bounded by O(k).

Conjecture 4 Within a component of G MDS , the range of remaining path length is O(k).
There are plausible reasons for having such a small range.Consider two extremes: (1) there are many F-moves and RF-moves valid at the same time in an MDS M , (2) there is only 1 F-move and 1 RF-move valid in M .In the first case, doing one of these F-moves or RF-moves affects the maximal paths that start or end at these moves.Consequently, many of these moves change the length of paths that are not the longest.In other words, these moves have no effect on the remaining path length.In the second case, it is possible to show that doing the 1 valid F-move does not change the remaining path length (the longest path is truncated by its first node and extending by one node, hence not changing in length).This type of situation is likely to happen when there are few F-moves and RF-moves possible.In both cases, most F-moves do not affect the remaining path length.
This conjecture partially justifies only exploring O(k) MDSs within one component in the simulated annealing algorithm in Section 5.2.

Discussion
Proportion of MDSs.A simple algorithm to generate a random MDS, sampling the space of MDSs uniformly, is to select at random k-mer from each PCR and check whether it is decycling, and to resample if not.Even though the space of MDSs is (maybe surprisingly) large, it is nonetheless only a tiny fraction of the PCR sets.The number of PCR sets is easily computable [8] and asymptotically there are Ω(k σ k /k ) PCR sets.There is no formula for the number of MDSs, but based on the numbers from Table 1, for k = 8 of the 2 × 10 29 PCR sets the proportion that are , one of the left-companion of f in solid gray is missing, then it could be extended to the left, contradicting maximality.Doing F-move f (changing solid gray for dashed nodes) can shorten the longest path by 1 node.Also, after doing F-move f , a path now ending in one of the solid gray node could be the longest and was extended by 1 node.If the path goes through an I-move f ′′ | m , then doing the I-move cuts the path in two possibly equal parts.Right: Comparison of the minimum and maximum remaining longest path for components of G MDS (2, k) for 4 ≤ k ≤ 8.Each point represents one connected component of the graph.The minimum and maximum remaining path lengths are computed over all the MDSs of a component.Therefore, the vertical distance of a point from the diagonal y = x (in yellow) shows the variation of remaining path length within a component.For k = 8, a subsample of 500 components were examined, as the total number of components is exceedingly large.The lines are drawn to depict the bounds of the increase between components.In all cases seen, the difference between the minimum and maximum remaining length within a component is in some range [α, α + k] for an alpha that is less than k.
MDSs is only 2 × 10 −18 .For k = 9 that proportion is essentially 0. Thus, the random sampling method is not of any practical use.
In that sense Conjecture 1, provided it is true, is an efficient method to enumerate all MDSs as only MDSs are ever considered without the need to filter out an overwhelming number nondecycling sets.Even if this conjecture is eventually proven wrong, the F-moves and I-moves allow us to explore a large subspace of MDSs, and, using simulated annealing or more advanced machine learning methods, to find MDSs with desirable properties.
Moreover, on the theoretical side, providing evidence for this conjecture lead us to a deeper understanding of the space of MDSs and to formulate other useful conjectures.
Mykkeltveit set and short windows.It is surprising (or lucky) that the first algorithm for constructing MDSs by Mykkeltveit [17] gives a set with close to the shortest remaining path length.This fact may explain retrospectively the success of previous methods using this set as the starting point to design minimizers schemes [18,19,6,20].The growth of the remaining path length for the Mykkeltveit set is well characterized [28]: it is Ω(k 2 ) and O(k 3 ).Fitting the data from Table 2 we obtain an exponent of 3.12±0.14,suggesting an actual growth of O(k 3 ).Provided that Conjecture 3 holds, this would answer the question of the shortest window guarantee that is possible using an MDS.For comparison, fitting the Champarnaud data gives an exponent of 6.1 ± 0.59.
Longest remaining path length.Conjecture 4 only suggests a bound on the range of remaining path length within a component of G MDS .A legitimate question is what is the bound of the range in G MDS as a whole.Figure 3 could suggests that this range is polynomial in k, although the trend in this figure is much too short to elevate this statement to a conjecture.Given the known results bounding the longest remaining path of the Mykkeltveit set by O(k 3 ), this would mean a polynomial bound on the remaining path length of MDSs.
This statement seems counterintuitive at first (and is, of course, not proven).We saw in Section 3 that syncmers have a window guarantee of σ k−1 , hence there exists DSs that are not of minimum size that have exponentially long remaining paths.How then can sets with fewer k-mers (MDSs) have a shorter remaining path length?The intuition is as follows.In the syncmers construction, we chose one exponentially long path (length σ k−1 − 1) through the graph while every node not on this path is added to the DS M .The size of the DS |M | = σ k (1 − 1/σ) is exponential as well: it takes many nodes, guiding that long path, to prevent cycles.On the other hand, the size of an MDS is ∼ σ k /k, which is o(σ k ).The average remaining path length is k and there are too few k-mers in an MDS to guide an exponentially long path to prevent it from creating cycles (i.e., to have back edges).

Conclusion
The window guarantee is an important requirement, theoretically and practically, to define and optimize sketching methods.As discussed, an underlying concept that can be extracted from the definition of this guarantee in any local sketching method is a set of nodes in the de Brujin graph which are unavoidable (i.e., decycling).While many such sets exist, the minimum-sized sets have important properties that can be exploited and examined.In this work, we described some of the first theoretical findings on properties of these sets, as well as a method to traverse many (if not all) MDSs for a given k-mer length.We also showed that the choice of MDS, whether direct or as an implication of the design of the sketching method, does have an impact on the strength of the window guarantee.Although we provide our major results as conjecture, we present significant evidence to support these claims.
■ By extension, in a chain of F-moves, reordering the F-moves, as long as it is valid, does not change the result.Note that there is no equivalent statement for I-moves: if f 1| m 1 , f 2| m 2 are two valid I-moves in M , then f 2| m 2 may not be valid in f 1| m 1 M .In the following proofs, we use the simplified representation for PCRs, F-and I-moves given in Figure 4.For simplicity, the figure shows an example with the binary alphabet.When σ > 2, an F-move f represents a hyperedge between σ PCRs rather than a simple edge as shown.
Proposition 3 (G MDS component structure) For any σ and k, the components of G MDS (σ, k) satisfy: 1. every component is strongly connected 2. every cycle is of length ασ k−1 , α ∈ N 3. in a cycle of length ασ k−1 , every possible F-move f ∈ Σ k−1 occurs exactly α times 4. every node is in a cycle of length σ k−1 (hence the girth is Proof Points 2 and 3, length of cycles.Every PCR is a cycle in D k and an MDS M is seen as pebbles sitting on the k-mers (see Figure 4 b) There is one pebble per PCR.An F-moves involves σ distinct PCRs (edges af → f a, a ∈ Σ are each in their own PCR).Hence an F-moves is an hyperedge connecting σ PCRs.An F-move is like moving the pebbles along σ PCRs at a time, from left-companions to right-companions, and this move is legal only if lc(f ) ⊂ M .In that sense, an F-move is like a semaphore: pebbles can move only if all their left-companions are present in the set.First, because every MDS has a valid F-move and a component of G MDS is finite, a component must have a cycle.Let C = (M 0 , . . ., M n−1 ) be a cycle of MDSs in G MDS , and equivalently Figure 4: Simplified representation of PCRs, F-moves and I-moves when σ = 2. a) shows two PCRs from the de Bruijn graph D k .Every k-mer is a circle, and they are all oriented counter-clock-wise (see PCR P 1 and P 2 here).Let f be an F-move that involves P 1 , P 2 .Here P 1 has the edge 0f → f 0, and P 2 has 1f → f 1: these are the PCR edges.The cross-PCR edges 0f → f 1 and 1f → f 0 form anti-parallel edges between P 1 and P 2 .b) The simplified PCR/pebbles representation shows PCRs as large cycles without representing individual k-mers and only representing the F-move edges of interest.The elements from the MDS in each PCR (the pebbles) are small black circles that can travel only counter-clock-wise around the PCR.An F-move is an edge between P 1 and P 2 and act as a semaphore: a pebble can move one step around the PCR and across the edge of f only when the other pebbles are present next to the edge in the other PCR (i.e., lc(f ) is in the MDS), as shown in b), and all pebbles move across the edge at the same time.c) The position of the pebbles for the I-move f | 1 : bit 0 is set but not bit 1, so the pebbles are on 0f and f 1 (left side of the edge of f ).The top pebble can move across the edge, counter-clock-wise, while the lower one stays still.For I-move f | 2 with bit 0 unset and bit 1 set, the pebbles would be on 1f and f 0, on the right side of the edge of f .d) If F-moves f and g have a PCR P in common, then, because F-moves act like semaphores, it is not possible to do the F-move f twice before g is done once.For the pebble to go around P to do f a second time, necessarily the F-move g was done as well.(2,4).The outer circle is C = (f 0 , . . ., f 7 ), a cycle of length (2,4) and it contains M 0 and M .C = (f 0 , . . ., f n−1 ) is a list of F-moves such that M i+1 = f i M i (indices taken modulo n).After doing F-move f 0 , the pebble on at least one PCR, say P 0 has moved.Because C is a cycle, by the time f n−1 is done, all pebbles are back on their respective starting spot.Meaning the pebble on P 0 went all the way around (possibly multiple times) P 0 .To move around P 0 with F-moves, the pebbles in the PCR adjacent to P 0 must have moved as well, and, by the time f n−1 is done, go around their respective PCRs.By transitivity, and because the de Bruijn graph is strongly connected, every pebble on every PCR has gone around its PCR after f n−1 is done.Because every node went around its PCR, this means that every one of the σ k−1 F-moves was done and n ≥ σ k−1 .
Conversely, because the F-move/hyperedge act as semaphores, it is not possible for a pebble on a PCR to do more rotations around its own PCR than the pebbles on the adjacent (by hyperedge) PCRs.To see this, consider the starting position of the pebble on PCR P 0 .For this pebble to start a second turn around P 0 , all of its left-companions must be back on their starting spot and also start a second turn around their own PCRs.This holds for all PCRs by transitivity.
Hence, in a cycle of the MDS graph, the pebbles of all PCRs go around the same number of times, say α, and the number of F-moves in the cycle C is n = ασ k−1 .

■
Proof Point 1, strongly connected.As in the previous proof, there exists a cycle C = (M 0 , . . ., M n−1 ) in G MDS , and its edges are (f 0 , . . ., f n−1 ) with M i+1 = f i M i .
We show that for any node M i of this cycle and any neighbor M of M i , reachable by an F-move or RF-move from M i , M and M i are in a cycle.If this holds, by transitivity of the relation "being in the same strongly-connected component", any pair of nodes in the component are in a cycle and the component is strongly-connected.
WLOG, let's prove it for M 0 (see Figure 5).It is a consequence of the commutativity of the F-moves (Lemma 1).Let M = f M 0 be a neighbor of M 0 for some f ̸ = f 0 .Because in a cycle all F-moves occur, there exists a first j ∈ hence it is also valid in M 1 , and recursively in , M 2 , . . ., M j .Therefore f commutes with f 0 , . . ., f j−1 and the series of F-move (f = f j , f 0 , . . ., f j−1 ) is another path from M 0 to M j+1 that is going through M .This path followed by the remainder of C from M j+1 back to M 0 is a cycle that includes both M 0 and M .
) be the chain of F-moves representing that cycle.Every distinct F-move occurs exactly α times in that chain.We show that the chain can be reordered so that the σ k−1 different F-moves occur at the first σ k−1 positions of the chain.
If it is not already the case that the first σ k−1 F-moves are distinct, there must be an F-move f that occurs twice in the list before an F-move g occurs for the first time.Let i < j be two indices which are the first two occurrences of f in the chain (i.e., f i = f j = f ), and such that j + 1 is the first occurrence of g (f j+1 = g).If any of the PCRs involved in the F-move f are also involved in the F-move g, then it is not possible to use f twice in C before using g (see Figure 4d).Therefore the PCRs involved in the F-moves f and g are distinct, and g must be a valid F-move just before the second of f as well.In other words, f j and f j+1 commute.
Repeated swapping of F-moves leads to the desired chain of F-moves with all σ k−1 distinct F-moves in the first positions, which induces a cycle of length σ k−1 containing M .■ Proof Point 5, σ k−1 -partite.Partition the nodes of a component of G MDS as follows.We create σ k−1 sets: P 0 , . . ., P σ k−1 −1 .Let M 0 be an arbitrary MDS of the component and assign it to the set P 0 .For every other MDS M , take a shortest path Because M 0 is in a cycle of length σ k−1 , every set P i has at least one MDS assigned to it.Moreover, every MDS is assigned to exactly one set.Hence the sets P i form a partition of the MDSs in the component.
An edge between MDSs in sets P i and P j with j > i + 1 would imply the existence of a cycle containing M 0 of length < σ k−1 , which is not possible.Proof.From Proposition 3, in any component there exists an MDS M ′ where f is a valid F-move.If there exists other valid F-moves than f in M ′ , do them recursively.I.e., we do every possible F-move in M ′ but refuse to do f .This creates a path P of MDSs in G MDS starting at M ′ that does not contain f as an edge.
Because every cycle in G MDS contains every possible F-move, P cannot induce a cycle, and it must terminate at an MDS M .By construction M is f -terminal.

■
An f -terminal MDS M has a useful property: every maximal path in D k that avoids M (as created by a walk like in Proposition 1) must start at a k-mer m ∈ rc(f ).Equivalently, any walk in D k that avoids M following edges backward ends at some m ∈ rc(f ).M 1 and M 2 are in different components, so they are distinct MDSs and there exists a PCR R where the selected k-mer is different.That is, R ∩ M 1 ≜ m 1 ̸ = m 2 ≜ R ∩ M 2 .Take a path P in D k following edges backward from node 0f (which is in both M 1 and M 2 ) to m 1 that avoids nodes af, a ∈ Σ \ {0}.Path P exists because D k is (σ − 1)-connected.Because m 1 ∈ M 1 ∆ M 2 , there must exist a first node m ∈ P which is in M 1 ∆ M 2 .
Let P 1 be the restriction of the path P from 0f to m and, WLOG, assume that m ∈ M 1 .By construction, Let P 2 be a path created by a maximal random walk in D k , following edges backward, starting from m and that avoids M 2 .Because M 2 is f -terminal, the walk ends at a node f a ∈ rc(f ), a ∈ Σ.By construction, |P 2 ∩ M 1 | ≥ |P 2 ∩ M 2 | = 0 (P 2 avoids nodes from M 2 but may contain nodes from M 1 ).
Two cases can happen.First case, there exists a first node m ′ ∈ P 1 ∩ P 2 .Then define the cycle C as the restriction of P 1 from m ′ to m followed by the restriction of P 2 from m to m ′ .Second case, P 1 ∩ P 2 = ∅ and define the cycle C as the concatenation of P 1 , P 2 and backward edge f a → 0f .
In both cases, C satisfies by construction H M 1 (C) > H M 2 (C).From M recursively do all valid F-moves except for the F-moves g c where m c = 0 to obtain M ′ ∈ χ where the only valid F-moves are exactly those than we refused to do.There must exist a ∈ Σ such that m a = 1 and af / ∈ M ′ , otherwise f | m is a valid I-move in M ′ (see Figure 7b).From af do a walk that avoids M ′ using backward edges.This walk must end at one of the rightcompanions of the valid F-moves in M ′ , that is there exists b such that walk ends at m ′ ∈ rc(g b ).

Figure 1 :
Figure1: a) For f ∈ Σ k−1 , the left-companions (k-mers 0f and 1f for the binary alphabet) and right-companions (f 0 and f 1) induce a directed complete bipartite K σ,σ .When the left-companions are in the set (left subgraph, highlighted in gray), an F-move replaces these nodes with the rightcompanions (right subgraph).An RF-move is the reverse operation, replacing the right-companions with the left-companions.b) When one k-mer is a homopolymer, the induced subgraph is slightly different, but the F-moves and RF-moves are defined analogously.c) One of the possible I-move,f | 1, where a mixture of left-and right-companions are in the set.d) The other possible I-move, f | 2. For any f ∈ Σ k−1 there are 1 F-move, 1 RF-move and 2 σ − 2 I-moves possible, unless f is a homopolymer.

Figure 2
Figure 2: a) MDS graph G MDS(2,4) with edge labels as numbers in [0, σ k−1 ] representing the Fmoves.There are 3 components.Each component is strongly connected and can be partitioned into σ k−1 = 8 layers with edges only from one layer to the next.The gray vertical boxes in the middle component highlight the layers, numbered from 0 to σ k−1 .Each layer in the middle component has size 1 or 2.An example of a cycle of length 8 with every F-move done exactly once is highlighted with dashed edges.b) Example of 2 components of non-decycling PCR sets.The components are DAGs with a longest path less than 8 edges.
and let χ be a component of G MDS .Then f | m is not a valid I-move in any MDS of χ if and only if ∃a, b such that m a = 1, m b = 0 and there exist a constrained cycle using the edge af → f b.

Figure 3 :
Figure3: Left: If a longest path does not start at a valid F-move f , i.e., one of the left-companion of f in solid gray is missing, then it could be extended to the left, contradicting maximality.Doing F-move f (changing solid gray for dashed nodes) can shorten the longest path by 1 node.Also, after doing F-move f , a path now ending in one of the solid gray node could be the longest and was extended by 1 node.If the path goes through an I-move f ′′ | m , then doing the I-move cuts the path in two possibly equal parts.Right: Comparison of the minimum and maximum remaining longest path for components of G MDS (2, k) for 4 ≤ k ≤ 8.Each point represents one connected component of the graph.The minimum and maximum remaining path lengths are computed over all the MDSs of a component.Therefore, the vertical distance of a point from the diagonal y = x (in yellow) shows the variation of remaining path length within a component.For k = 8, a subsample of 500 components were examined, as the total number of components is exceedingly large.The lines are drawn to depict the bounds of the increase between components.In all cases seen, the difference between the minimum and maximum remaining length within a component is in some range [α, α + k] for an alpha that is less than k.

■ 3 Lemma 2
Cycle signature unique per componentAn MDS M is called f -terminal if the only valid F-move in M is f .For any f ∈ Σ k−1 and in any component of G MDS , there exists an f -terminal MDS.

Proposition 4 1 .Figure 6 :
Figure 6: Simplified example for finding the complementary I-moves, when σ = 2. On the left box, component χ 1 and component χ 2 on the right, of G MDS .The cycle C 1 , C 2 are cycles in χ 1 and χ 2 respectively.The simplified PCR/pebble drawings represent the position of the pebbles on the PCRs of P m (top PCR) and P m (bottom PCR).The edge between these PCRs represents f .The PCR/pebbles drawings next to the MDS nodes represent the state of the PCRs for these MDSs, while the drawings next to the F-move lists represent the action of the list of F-moves on the pebbles.From the cycle C 1 in χ 1 , we construct cycle C 2 in χ 2 by swapping the order of the F-moves: (F m , f, F m ) → (F m , f, F m ).These cycles go through the desired MDSs M ′ 2 and M ′ 1 that are linked by the complementary I-move f | m .

■ 4 G
comp is undirected Proposition 5 (G comp is undirected) Let f | m be a valid I-move from MDS M 1 in component χ 1 to M 2 in χ 2 .Then there exists M ′ 2 , M ′ 1 in χ 2 , χ 1 , respectively, such that f | m (where m is the bit-complement of m) is a valid I-move from M ′ 2 to M ′ 1 .

Figure 7 :
Figure 7: a) The f | m with m a = 1 and m b = 0 is not possible because H M (C) = 1.When the I-move f | m is valid, necessarily C's hitting number must be at least 2. b) Suppose f | m is never valid, then a backward walk creates a cycle with hitting number 1 using the edge af → f b.
By construction there is a backward edge m ′ → f b.Then follow the backward edge f b → af to create a cycle C.By construction the only node from M ′ in cycle C is f b, hence H M ′ (C) = 1 and C uses the edge af → f b with m a = 1 and m b = 0.

Table 1 :
[8,10] and G MDS properties for σ = 2. "Layer range" gives, when possible, the range of the number of MDSs in each layer of G MDS .The numbers for k ≤ 7 are exact, computed from the exhaustive list of MDSs.For columns k ∈[8,10], the number of components is correct provided the conjectures are correct, otherwise the numbers provided are under-estimations.For k = 8, the layer size and number of MDSs are estimated by sampling 100 random components.For k = 9, the numbers are likely severe under-estimations.For k = 10, computation is too expansive.

Table 2 :
The remaining path length for the Mykkeltveit and Champarnaud sets compared to the range of remaining path length.For σ = 2 and k ≤ 7 (underscored), the range of remaining path length is computed exactly from the exhaustive list of MDSs.All other values are estimated using a simulated annealing (SA) algorithm.